library(tidyverse)
library(readxl)
KOMP Plasma Metabolomics Dataset
Mouse knockouts facilitate the study of gene functions. Often, multiple abnormal phenotypes are induced when a gene is inactivated. The International Mouse Phenotyping Consortium (IMPC) has generated thousands of mouse knockouts and cataloged their phenotype data. We have acquired metabolomics data from 220 plasma samples from 30 unique mouse gene knockouts and corresponding wild-type mice from the IMPC. To acquire comprehensive metabolomics data, we have used liquid chromatography (LC) combined with mass spectrometry (MS) for detecting polar and lipophilic compounds in an untargeted approach. We have also used targeted methods to measure bile acids, steroids and oxylipins. In addition, we have used gas chromatography GC-TOFMS for measuring primary metabolites. The metabolomics dataset reports 832 unique structurally identified metabolites from 124 chemical classes as determined by ChemRICH software
In this demo we will use only the metabolites quantified in the targeted assays. The belong to the chemical classes of bile acids, steroids and oxylipins
Here we get the data
load("../data/KOMP_data_targeted.RData")
We have now three files
The amount of info in the table is uncommon
colnames(metabolite_meta)
## [1] "CompoundID" "Order" "DataBaseID"
## [4] "Type" "Assay" "AnnotationApproach"
## [7] "CompoundName" "IKFirstBlock" "InChiKey"
## [10] "RetentionTime" "MZ" "SMILES"
## [13] "KEGGID" "PubChemID" "QC_RSD"
## [16] "BlankRatio" "AcylChain"
What is interesting for us is to see how much effort has been made to precisely identify the compounds. The problem of “naming” is typical of analytical chemistry and metabolomics.
metabolite_meta$InChiKey[1]
## [1] "RUDATBOHQWOJDD-BSWAIDMHSA-N"
Some of the identifier refer to publicly available databases. This is the general trend of this scientific discipline.
KEGGID refers to kegg so it can be used to link the metabolite to its position within metabolic networks.
The second tibble contains the description of the sample metadata
head(sample_meta_data)
## # A tibble: 6 × 4
## MouseID Genotype Gender Zygosity
## <dbl> <chr> <chr> <chr>
## 1 413741 Pebp1 Male Homo-
## 2 413742 Pebp1 Male Homo-
## 3 413745 Pebp1 Male Homo-
## 4 428689 Pebp1 Female Homo-
## 5 428693 Pebp1 Female Homo-
## 6 433002 Pebp1 Female Homo-
The field are auto explanatory. The MouseID is the code we need to associate the sample to the class
First of all we give a look to the experimental design, just to understand the numbers we are dealing with
table(sample_meta_data$Gender,sample_meta_data$Genotype)
##
## A2m Ahcy Atp5b Atp6v0d1 C8a Cdk4 Dhfr Dync1li1 G6pd2 Galc Gnpda1 Idh1
## Female 3 3 3 3 3 3 3 3 3 3 3 3
## Male 3 3 3 3 3 3 3 3 3 3 3 3
##
## Iqgap1 Lmbrd1 Mfap4 Mmachc Mvk Nek2 Npc2 Null Pebp1 Phyh Pipox Plk1
## Female 3 3 3 3 3 3 3 20 3 3 3 3
## Male 3 3 3 3 3 3 3 20 3 3 3 3
##
## Pmm2 Ptpn12 Pttg1 Rock1 Sra1 Ulk3 Ywhaz
## Female 3 3 3 3 3 3 3
## Male 3 3 3 3 3 3 3
head(DM[,1:20])
## # A tibble: 6 × 20
## CompoundID KOMP_413741 KOMP_…¹ KOMP_…² KOMP_…³ KOMP_…⁴ KOMP_…⁵ KOMP_…⁶ KOMP_…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CPD000001 124. 50.6 67.0 227. 101. 357. 151. 38.1
## 2 CPD000002 63.7 88.3 38.9 74.9 16.0 126. 238. 44.8
## 3 CPD000003 15.1 64.0 21.8 84.7 69.4 186. 19.2 183.
## 4 CPD000004 116. 549. 71.6 195. 526. 146. 295. 179.
## 5 CPD000005 NA NA NA 0.55 0.87 NA NA NA
## 6 CPD000006 NA 0.53 1 NA 1.49 6.93 NA 1.68
## # … with 11 more variables: KOMP_115996 <dbl>, KOMP_138110 <dbl>,
## # KOMP_138120 <dbl>, KOMP_190701 <dbl>, KOMP_143866 <dbl>, KOMP_143867 <dbl>,
## # KOMP_143869 <dbl>, KOMP_144064 <dbl>, KOMP_123951 <dbl>, KOMP_123953 <dbl>,
## # KOMP_487464 <dbl>, and abbreviated variable names ¹KOMP_413742,
## # ²KOMP_413745, ³KOMP_428689, ⁴KOMP_428693, ⁵KOMP_433002, ⁶KOMP_185262,
## # ⁷KOMP_115900
Since our mouse ID is a number without the leading “KOMP_” we fix this on the fly
DM <- DM %>%
as.data.frame() %>%
column_to_rownames("CompoundID") %>%
t(.) %>%
as_tibble(rownames = "Sampname") %>%
separate(Sampname, c("One","MouseID"),"_") %>% ## split the column
dplyr::select(-One) %>% ## remove the first piece
mutate(MouseID = as.numeric(MouseID))
Now the matrix looks like this
head(DM[,1:20])
## # A tibble: 6 × 20
## MouseID CPD000001 CPD000002 CPD000003 CPD000…¹ CPD00…² CPD00…³ CPD00…⁴ CPD00…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 413741 124. 63.7 15.1 116. NA NA 140. 15.4
## 2 413742 50.6 88.3 64.0 549. NA 0.53 325. 21.1
## 3 413745 67.0 38.9 21.8 71.6 NA 1 59.0 5.15
## 4 428689 227. 74.9 84.7 195. 0.55 NA 110 13.1
## 5 428693 101. 16.0 69.4 526. 0.87 1.49 318. 42.4
## 6 433002 357. 126. 186. 146. NA 6.93 115. 13.1
## # … with 11 more variables: CPD000009 <dbl>, CPD000010 <dbl>, CPD000011 <dbl>,
## # CPD000012 <dbl>, CPD000013 <dbl>, CPD000014 <dbl>, CPD000015 <dbl>,
## # CPD000016 <dbl>, CPD000792 <dbl>, CPD000793 <dbl>, CPD000794 <dbl>, and
## # abbreviated variable names ¹CPD000004, ²CPD000005, ³CPD000006, ⁴CPD000007,
## # ⁵CPD000008
To check that we organize the matrix in a different way also joining the concentration data with the sample metadata
DM_nest <- DM %>%
left_join(sample_meta_data) %>%
pivot_longer(starts_with("CPD"), names_to = "CompoundID", values_to = "I") %>%
group_by(CompoundID) %>%
nest()
This data structure highlights the role of each metabolite
head(DM_nest)
## # A tibble: 6 × 2
## # Groups: CompoundID [6]
## CompoundID data
## <chr> <list>
## 1 CPD000001 <tibble [220 × 5]>
## 2 CPD000002 <tibble [220 × 5]>
## 3 CPD000003 <tibble [220 × 5]>
## 4 CPD000004 <tibble [220 × 5]>
## 5 CPD000005 <tibble [220 × 5]>
## 6 CPD000006 <tibble [220 × 5]>
the intensities of each metabolite are stored in the “data” column
head(DM_nest$data[[1]])
## # A tibble: 6 × 5
## MouseID Genotype Gender Zygosity I
## <dbl> <chr> <chr> <chr> <dbl>
## 1 413741 Pebp1 Male Homo- 124.
## 2 413742 Pebp1 Male Homo- 50.6
## 3 413745 Pebp1 Male Homo- 67.0
## 4 428689 Pebp1 Female Homo- 227.
## 5 428693 Pebp1 Female Homo- 101.
## 6 433002 Pebp1 Female Homo- 357.
The new tibble can be efficiently used to inspect the data in a univariate perspective …
For example, the number of NAs in relation of the median intensity can be visualized as
DM_nest %>%
mutate(nnas = map_dbl(data,function(x) sum(is.na(x$I))), ## calculate the univariate statistics "on the fly"
medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>%
filter(nnas > 0) %>% ## we keep only features with at least one NA!
ggplot() + ## plot them
geom_point(aes(x = medI, y = nnas)) +
scale_x_log10() +
theme_light()
Which actually shows what we expected. High number of missing values are visible only with low median intensities … note the log scale in x.
This is telling us that there is a distinction of metabolites in two groups showing a marked difference in concentration
Is this effect dependent on the class of metabolite?
DM_nest %>%
left_join(metabolite_meta) %>%
mutate(nnas = map_dbl(data,~sum(is.na(.x$I))), ## calculate the univariate statistics "on the fly"
medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>%
filter(nnas > 0) %>% ## we keep only features with at leas on NA!
ggplot(aes(x = medI, y = nnas, col = Assay)) + ## plot them
geom_point() +
scale_x_log10() +
scale_color_brewer(palette = "Set2") +
theme_light()
That’s strange … the largest number of NAs is corresponding to lower intensities for the Bile Acids_Steroids and not for Oxylipins. This is curious
It is worth highlighting the name of the metabolites showing a number of NAs larger than 50 …
DM_nest %>%
left_join(metabolite_meta) %>%
mutate(nnas = map_dbl(data,~sum(is.na(.x$I))), ## calculate the univariate statistics "on the fly"
medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>%
filter(nnas > 0) %>% ## we keep only features with at leas on NA!
ggplot(aes(x = medI, y = nnas, col = Assay)) + ## plot them
geom_point(size = 2) +
geom_text(aes(label=ifelse(nnas>50,CompoundName,'')), size = 4, nudge_y = -3) +
scale_x_log10() +
scale_color_brewer(palette = "Set2") +
theme_light()
As we discussed in the lecture, it is difficult to decide which metabolites have too many missing values …
When we have a “limited” number of variables the best approach is to
plot the data. With the DM_nest we can do that quite
efficiently
To do that I create a custom plotting function which takes as input the tibble of the data and a potential title
myplot <- function(t,l = ""){
p <- t %>%
ggplot() +
geom_jitter(aes(x = Genotype, y = I, color = Gender, shape = Zygosity), width = 0.1) +
scale_y_log10() +
scale_color_brewer(palette = "Set1") +
theme_light() +
ggtitle(l) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
p
}
myplot(DM_nest$data[[1]], "test")
And I recursively apply it to the nested data tibbles …
DM_nest <- DM_nest %>%
left_join(metabolite_meta %>% dplyr::select(CompoundID,CompoundName)) %>%
mutate(rawplots = map2(data,CompoundName, ~myplot(.x,.y)))
Now I can visualize (or save) the plot I want …
This one, in particular show the trend of one of the features with a large fraction of NAs
DM_nest$rawplots[[5]]
As it can be expected, progesterone is lower in males, and, most likely close to the detection limit. It is worth remarking that progesterone was the steroid showing the lower intensity. Its behavior is then not odd.
Let’s now look to PGD2
DM_nest$rawplots[[22]]
The trend of this metabolite is really wired. Its large fraction of
missing values do not seems to be associated to the intensity of the
signal
As far as I can see (but you could check better … ;-) ), the distribution of the missing data does not seems to be higly associated with the study factors.
Considering that we have 220 samples the level of missingness is never extremely severe, so when it will be needed imputation should not be something that strongly influence the
Something For you
The need of displaying the intensity in log scale is already suggesting that the distribution of the intensities is far from being normal. This is not unexpected in metabolomics. In real life what i do is to work almost invariably with log-transformed datasets.
For sure
We can follow the spirit of the previous plot by defining a function and using it to produce a full list of plots …
An alternative could be to unnest the data and rely on
faceting …
DM_nest %>%
ungroup() %>%
dplyr::select(CompoundName,data) %>%
unnest(data) %>%
ggplot(aes(sample = I)) +
stat_qq(aes(col = Gender, shape = Zygosity)) +
scale_color_brewer(palette = "Set1") +
stat_qq_line() +
facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
theme_light() +
theme(aspect.ratio = 1,
legend.position = "bottom")
Ok, the plot is “ugly” (maybe creating a set of plots as we did before could have been more pleasant), but is really informative.
DM_nest$rawplots[[50]]
The interpretation for that would, most likely require a chat with a good biochemist…
Let’s look to the global effect of a log transformation on the q-q plots
DM_nest %>%
ungroup() %>%
dplyr::select(CompoundName,data) %>%
unnest(data) %>%
ggplot(aes(sample = log10(I))) +
stat_qq(aes(col = Gender, shape = Zygosity)) +
scale_color_brewer(palette = "Set1") +
stat_qq_line() +
facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
theme_light() +
theme(aspect.ratio = 1,
legend.position = "bottom")
The situation largely improves, even if some compounds are still showing strange behaviours …
Something For you
The previous univariate journey should have been already pointing out some interesting aspects of the dataset.
The subsequent step is to look to visualize the multivariate structure of the data to spot larger scale patterns.
In order to do that it is necessary to impute the data matrix, substituting NAs with meaningful numbers.
My choice here is to work variable wise and replace the NAs with a random number drawn from an uniform distribution spanning from 0 to the minimum value measured for that variable
The rationale behind this choice is that
In my practical example imputation could be perfoed both in the
DM and in its nested version DM_nest. Since we
are going to do PCA, I’ll work on the unnested data matrix.
First of all I need a function to perform the imputation of a vector
myimputer <- function(v){
if (sum(is.na(v)) == 0) { ## If I do not have NAs please leave the vector unchanged
return(v)
} else {
napos <- which(is.na(v)) ## position of the NAs in the vector
newval <- runif(length(napos),0,min(v, na.rm = TRUE)) ## calculate the random numbers I should put in place of the NAs
out <- v
out[napos] <- newval
return(out)
}
}
Now we apply it to the full set of columns
DM_i <- DM %>%
mutate(across(starts_with("CPD"), ~myimputer(.x)))
As we have seen, log transformation is helpful …
DM_i <- DM_i %>%
mutate(across(starts_with("CPD"), ~log10(.x)))
And now PCA!
library(FactoMineR)
library(factoextra)
myPCA <- PCA(DM_i %>% dplyr::select(starts_with("CPD")), graph = FALSE)
fviz_pca_ind(myPCA,
habillage = factor(sample_meta_data$Gender),
geom = "point",
pointsize = 2,
axes = c(1,2)) +
scale_color_brewer(palette = "Set1")
Even if the amount of variability captured in this representation is
only the 30%, the previous plot shows the presence of a separation
between males and females.
This is not unexpected considering the results of our univariate investigation
What about the variables?
fviz_pca_biplot(myPCA,
habillage = factor(sample_meta_data$Gender),
geom = "point",
pointsize = 2,
axes = c(1,2)) +
scale_color_brewer(palette = "Set1")
The biplot here shows that females show an higher level of the large majority of the metabolites.
Since the partial separation is visible along PC1 we could visualize the variables which show the higher contribution to that direction
fviz_contrib(myPCA, "var", axes = 1)
These plots are really informative, let’s give a look inside the PCA object to see how these number could be get out.
This list contains the information about the samples (the scores) so it can be used to reconstruct the score plot. What one needs are the coordinates
str(myPCA$ind)
## List of 4
## $ coord : num [1:220, 1:5] 1.25 -2.52 -3.23 1.56 3.72 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:220] "1" "2" "3" "4" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ cos2 : num [1:220, 1:5] 0.0551 0.1763 0.1952 0.0628 0.1714 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:220] "1" "2" "3" "4" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ contrib: num [1:220, 1:5] 0.06 0.2419 0.3995 0.0924 0.5275 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:220] "1" "2" "3" "4" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ dist : Named num [1:220] 5.34 5.99 7.32 6.21 8.98 ...
## ..- attr(*, "names")= chr [1:220] "1" "2" "3" "4" ...
The dist and cos2 can be used to assess how
far a sample is from the PCA plane (or projection). This is important to
identify potential outliers.
The list
str(myPCA$var)
## List of 4
## $ coord : num [1:57, 1:5] 0.501 0.438 0.224 0.614 0.241 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ cor : num [1:57, 1:5] 0.501 0.438 0.224 0.614 0.241 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ cos2 : num [1:57, 1:5] 0.2515 0.1916 0.05 0.3773 0.0583 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ contrib: num [1:57, 1:5] 2.11 1.61 0.42 3.17 0.49 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
instead contains the info about the variables, their loadings and the correlation with the principal components.
To extract the previous information in a more programmatic way,
FactoMineR has a specific function
head(facto_summarize(myPCA, "var", axes = 1))
## name Dim.1 coord cos2 contrib
## CPD000001 CPD000001 0.50146696 0.2514691117 0.2514691117 2.113225335
## CPD000002 CPD000002 0.43771074 0.1915906960 0.1915906960 1.610035960
## CPD000003 CPD000003 0.22358750 0.0499913695 0.0499913695 0.420103399
## CPD000004 CPD000004 0.61422802 0.3772760659 0.3772760659 3.170446403
## CPD000005 CPD000005 0.24146089 0.0583033611 0.0583033611 0.489953374
## CPD000006 CPD000006 0.02785031 0.0007756396 0.0007756396 0.006518101
which returns as a table the content of the PCA object.
By using the content of the myPCA object it is possible to
“reconstruct” the PCA plots with ggplot and this can be
handy if specific representation of the data are needed.
Suppose, for example, that you would like to construct a more informative variable importance plot which shows the correct metabolite name and also color the bars according to the chemical/metabolic class of the compound.
facto_summarize(myPCA, "var", axes = 1) %>%
as_tibble() %>%
dplyr::select(name, `Dim.1`) %>%
left_join(metabolite_meta, by = c("name" = "CompoundID")) %>%
arrange(`Dim.1`) %>%
mutate(CompoundName = factor(CompoundName, levels = CompoundName)) %>%
ggplot() +
geom_point(aes(x = CompoundName, y= `Dim.1`, col = Assay), size = 2) +
geom_segment(aes(x = CompoundName, yend = `Dim.1`, col = Assay, xend = CompoundName, y = 0)) +
scale_color_brewer(palette = "Set2") +
coord_flip() +
theme_light()
Which can be of help for the interpretation of the results
Something For you
What we did so far is called exploratory data analysis. To write the paper we should also find the variables which are showing a significant effect of the design factors.
Notes
For illustrative purposes let’s focus on the Female-Male Comparison.
## here i nest the imputed data matrix, most likely I could do the testing also in the "non" imputed one
## we also add the gender of the animal
DM_i_nest <- DM_i %>%
pivot_longer(starts_with("CPD"), names_to = "CompoundID", values_to = "I") %>%
left_join(sample_meta_data) %>%
nest(-c(CompoundID))
Now we run the tests on the metabolites
DM_i_nest <- DM_i_nest %>%
mutate(wilcoxon = map(data, ~wilcox.test(I ~ Gender, data = .x, exact = FALSE)))
If we want to see one of the tests …
DM_i_nest$wilcoxon[[10]]
##
## Wilcoxon rank sum test with continuity correction
##
## data: I by Gender
## W = 7374.5, p-value = 0.005036
## alternative hypothesis: true location shift is not equal to 0
Now we extract the p-values
DM_i_nest <- DM_i_nest %>%
mutate(ps = map_dbl(wilcoxon, ~.x$p.value))
As we have seen yesterday, a first idea on the presence of “relevant” differences in the two classes is the distribution of the p-values
hist(DM_i_nest$ps)
Create a set of data with permuted intensities
DM_i_nest <- DM_i_nest %>%
mutate(perm_data = map(data, function(t) t %>% mutate(I = sample(I)))) %>%
mutate(perm_wilcoxon = map(perm_data, ~wilcox.test(I ~ Gender, data = .x, exact = FALSE))) %>%
mutate(perm_ps = map_dbl(perm_wilcoxon, ~.x$p.value))
hist(DM_i_nest$perm_ps)
This is clearly not uniform and it speaks of the presence of a significant fraction of differences between males and females.
Now we would like to rank the variables on the base of their contrast on the two classes … but as we know it is not correct to use the p-value for that.
We need to calculate the effect size! Now I’ll use the Cohen’s d. A word of caution here, this type of effect size is parametric … but I’ve been stressing that non parametric analysis here should be preferred. Right. A non parametric measure of the effect size should be better. A possibility could be to calculate the ratio between the difference between the two medians and divide it by the larger interquartile range. You could be creative there …
Let’s stick to Cohen’s d. In the same time I’ll also calculate the p-values corrected for multiplicity
library(effsize)
DM_i_nest <- DM_i_nest %>%
mutate(pcorr = p.adjust(ps, method = "bonferroni")) %>%
mutate(cohend = map_dbl(data, ~cohen.d(I ~ Gender, data = .x)$estimate))
Now we use the effect size and the p value to create a volcano plot
library(plotly)
ggplotly(DM_i_nest %>%
left_join(metabolite_meta %>% select(CompoundID,CompoundName)) %>%
ggplot(aes(text=CompoundName)) +
geom_hline(yintercept = 5, lty = 2, col = "red") +
geom_point(aes(x = cohend, y = -log10(ps)), fill = "orange", size = 3, pch = 21) +
theme_bw(), tooltip="text")
Now we focus on the compound which are significant at the 0.01 level and we sort the results by effect size …
markers <- DM_i_nest %>%
filter(pcorr < 0.01) %>%
arrange(desc(abs(cohend))) %>%
left_join(metabolite_meta)
head(markers)
## # A tibble: 6 × 25
## CompoundID data wilcoxon ps perm_d…¹ perm_…² perm_ps pcorr cohend
## <chr> <list> <list> <dbl> <list> <list> <dbl> <dbl> <dbl>
## 1 CPD000829 <tibble> <htest> 8.22e-30 <tibble> <htest> 0.353 4.69e-28 1.66
## 2 CPD000830 <tibble> <htest> 5.21e-21 <tibble> <htest> 0.733 2.97e-19 1.38
## 3 CPD000807 <tibble> <htest> 1.84e-19 <tibble> <htest> 0.265 1.05e-17 1.30
## 4 CPD000808 <tibble> <htest> 4.90e-17 <tibble> <htest> 0.638 2.79e-15 1.18
## 5 CPD000005 <tibble> <htest> 1.92e-14 <tibble> <htest> 0.144 1.09e-12 1.17
## 6 CPD000806 <tibble> <htest> 8.43e-12 <tibble> <htest> 0.555 4.81e-10 0.768
## # … with 16 more variables: Order <dbl>, DataBaseID <dbl>, Type <chr>,
## # Assay <chr>, AnnotationApproach <chr>, CompoundName <chr>,
## # IKFirstBlock <chr>, InChiKey <chr>, RetentionTime <dbl>, MZ <dbl>,
## # SMILES <chr>, KEGGID <chr>, PubChemID <chr>, QC_RSD <dbl>,
## # BlankRatio <dbl>, AcylChain <chr>, and abbreviated variable names
## # ¹perm_data, ²perm_wilcoxon
This is the list of “biomarkers”, let’s plot one of them …
DM_nest$rawplots[[54]]
Which exactly shows what we expect.
Notes
The association between the intensity of two metabolites in the samples is often considered an interesting evidence in the perspective of the interpretation. Metabolites are indeed linked in metabolic pathways and their concentrations are not independent
As we have seen yesterday, however, correlation can be there only by chance so care have to be taken to jump from the observation of an association to the conclusion of a causal relation.
These caveats are particularly true in the case of untargeted metabolomics assays and can be somehow taken more easily as far as targeted metabolomics is concerned.
The best tool to inspect variable correlation in R is the
pairs plot. With pairs one really sees the
experimental data and it is easy to spot subgroups and outlying
samples.
Unfortunately, the pairs visualization is manageable only with a small number of variables (something like 15). For this reason it is a resource which can be exploited after the identification of the most relevant biomarkers
## here I get the names of the first 5
focuscp <- markers %>%
slice(1:5) %>%
pull(CompoundID)
## This is a littel of magic ...
mypalette <- c("#F24D2970","#1C366B70")
names(mypalette) <- unique(sample_meta_data$Gender)
par(pty="s")
DM_i %>%
dplyr::select(all_of(focuscp)) %>%
as.matrix(.) %>%
pairs(., col = mypalette[sample_meta_data$Gender], pch = 19)
Notes
Something For you